INTRODUCTION¶
Speaker recognition and speech recognition are two distinct but related fields within the broader domain of audio signal processing and analysis. Here are the key differences between them:
Objective:
- Speaker Recognition: The primary goal of speaker recognition is to identify or verify the identity of a speaker based on their unique vocal characteristics, often referred to as speaker "voiceprints" or "biometric signatures." It is a form of biometric authentication.
- Speech Recognition: The main objective of speech recognition is to convert spoken language into text or other forms of commands. Speech recognition systems analyze audio signals to understand and transcribe the spoken words.
Focus:
- Speaker Recognition: Focuses on the unique characteristics of an individual's voice, such as pitch, tone, accent, and speech patterns, to establish the speaker's identity.
- Speech Recognition: Focuses on understanding and interpreting the linguistic content of spoken words, regardless of the speaker. It involves converting spoken language into a textual representation.
Applications:
- Speaker Recognition: Commonly used in security systems, access control, and authentication applications where the identity of the speaker needs to be verified.
- Speech Recognition: Applied in various fields, including voice-activated assistants, transcription services, voice commands in smart devices, and interactive voice response (IVR) systems.
Challenges:
- Speaker Recognition: Faces challenges such as variations in voice due to health, emotional state, or environmental conditions. It also needs to account for potential attempts at voice impersonation.
- Speech Recognition: Challenges include dealing with variations in accents, background noise, and context-dependent language understanding. It requires sophisticated natural language processing (NLP) techniques.
Techniques:
- Speaker Recognition: Uses techniques such as speaker verification (confirming identity) and speaker identification (naming the speaker) based on feature extraction and pattern matching.
- Speech Recognition: Utilizes techniques such as Hidden Markov Models (HMMs), deep neural networks (DNNs), and recurrent neural networks (RNNs) for acoustic modeling and language modeling.
Output:
- Speaker Recognition: Outputs the identity or verification result of the speaker.
- Speech Recognition: Outputs the transcribed text or recognized spoken commands.
While speaker and speech recognition have distinct objectives, they can complement each other in applications where both speaker identity and spoken content need to be considered, such as in security systems with voice commands.
Problem Statement: In this project, we have 1500 1 second wav files for each of the 5 speakers. We will try to train a RNN(Recurrent Neural Network) model to predict if our model captures the speaker properly. There are also background noise files on separate folders if we want to use them and generalize our model further. I will implement without using the background noise files. Recurrent Neural Networks (RNNs) are well-suited for certain aspects of speaker recognition due to their ability to model sequential dependencies and capture temporal patterns in data.
IMPORTING LIBRARIES¶
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import shutil
import matplotlib.pyplot as plt
import numpy as np
# Output directory to clear
output_dir = "/kaggle/working/combined_files"
# Clear the contents of the output directory
shutil.rmtree(output_dir, ignore_errors=True)
os.makedirs(output_dir, exist_ok=True)
print(f"Contents of {output_dir} cleared.")
Contents of /kaggle/working/combined_files cleared.
Create combined files for each speaker¶
Using librosa and soundfile packages to create the combined files.I am only taking the first 120 files from each speaker folder to create 2 min long snippets of each speech.
librosa is a Python package for music and audio analysis. It provides tools to analyze and visualize audio data, including functions for feature extraction, time-series representation, and visualization of audio signals. It is commonly used in the field of music information retrieval and audio signal processing.
soundfile is a Python library for reading and writing sound files. It provides an easy-to-use interface for working with audio files, supporting various formats such as WAV, FLAC, and OGG. soundfile is often used in conjunction with librosa when working with audio data, as it helps to load and save audio files efficiently.
import librosa
import soundfile as sf
# Path to the dataset
dataset_path = "/kaggle/input/speaker-recognition-dataset/16000_pcm_speeches"
# Output directory to save the combined files
output_dir = "/kaggle/working/combined_files"
# Create the output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
# List of speaker folders
speaker_folders = [
"Benjamin_Netanyau",
"Jens_Stoltenberg",
"Julia_Gillard",
"Magaret_Tarcher",
"Nelson_Mandela"
]
# Number of files to combine for each speaker
num_files_to_combine = 120
# Iterate over each speaker's folder
for speaker_folder in speaker_folders:
speaker_folder_path = os.path.join(dataset_path, speaker_folder)
# List the first num_files_to_combine WAV files in the speaker's folder
wav_files = [f"{i}.wav" for i in range(num_files_to_combine)]
# Combine all WAV files into a single long file
combined_audio = []
for wav_file in wav_files:
wav_file_path = os.path.join(speaker_folder_path, wav_file)
audio, sr = librosa.load(wav_file_path, sr=None)
combined_audio.extend(audio)
# Save the combined audio file
output_file_path = os.path.join(output_dir, f"{speaker_folder}_combined.wav")
sf.write(output_file_path, combined_audio, sr)
print("Combination complete. Combined files saved in:", output_dir)
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Combination complete. Combined files saved in: /kaggle/working/combined_files
IPython is an interactive command-line shell for Python. It provides an enhanced interactive environment for Python programming and is particularly popular among data scientists, researchers, and engineers working in scientific computing, data analysis, and machine learning.
from IPython.display import display, Audio
# Function to play audio file
def play_audio(audio_path):
display(Audio(filename=audio_path))
# Play a specific combined audio file
speaker_folder = "Benjamin_Netanyau_combined"
audio_path = os.path.join(output_dir, f"{speaker_folder}.wav")
print(f"Click the play button to listen: {audio_path}")
play_audio(audio_path)
Click the play button to listen: /kaggle/working/combined_files/Benjamin_Netanyau_combined.wav
# Function to play audio file
def play_audio(audio_path):
display(Audio(filename=audio_path))
# Play a specific combined audio file
speaker_folder = "Nelson_Mandela_combined"
audio_path = os.path.join(output_dir, f"{speaker_folder}.wav")
print(f"Click the play button to listen: {audio_path}")
play_audio(audio_path)
Click the play button to listen: /kaggle/working/combined_files/Nelson_Mandela_combined.wav
DATA VISUALIZATIONS¶
A waveform and spectrogram are two types of visual representations of audio signals:
Waveform:
- Representation: A waveform is a time-domain representation of an audio signal.
- Axis: The x-axis represents time, and the y-axis represents the amplitude (loudness) of the signal at each point in time.
- Features: It shows how the amplitude of the audio signal changes over time.
- Interpretation: Peaks and valleys in the waveform correspond to changes in air pressure, which are perceived as sound.
Spectrogram:
- Representation: A spectrogram is a frequency-domain representation of an audio signal.
- Axis: The x-axis represents time, the y-axis represents frequency, and the color/intensity represents the magnitude (energy) of frequencies at different times.
- Features: It provides a 2D representation of how the frequency content of the signal changes over time.
- Interpretation: Dark regions in the spectrogram indicate the presence of certain frequencies at specific times.
The Mel-Frequency Cepstral Coefficients (MFCCs) plot is a representation of the audio signal in the frequency domain. MFCCs are coefficients that collectively represent the short-term power spectrum of a sound signal. The plot visualizes the variations in the spectral content of the audio signal over time.
Here's what the MFCC plot can show:
Time vs. Frequency: The x-axis represents time, and the y-axis represents different frequency bands. Each column in the plot corresponds to a short segment of time, and the height of the plot at a particular frequency band represents the magnitude or intensity of the signal in that frequency range during that time segment.
Feature Extraction: MFCCs are used as features for audio processing tasks. Each row in the plot corresponds to one of the MFCC coefficients. These coefficients capture important characteristics of the audio signal, such as the shape of the vocal tract, which is useful for tasks like speech and audio recognition.
Spectral Characteristics: Peaks and patterns in the plot can indicate specific frequencies or patterns in the audio signal. For example, formants in speech can be identified as concentrations of energy at specific frequencies.
Analysis of Sound Patterns: By observing how the MFCCs change over time, you can analyze sound patterns, distinguish between different sounds, and extract features for use in machine learning models for tasks like speaker identification, emotion recognition, or speech-to-text.
We will look at some plots for few speakers
import librosa.display
# Function to plot the waveform, spectrogram, and MFCCs
def plot_audio_features(audio_path):
# Load audio file
y, sr = librosa.load(audio_path, sr=None)
# Extract speaker name from the file path
speaker_name = os.path.basename(audio_path).split('_')[0]
# Plot the waveform
plt.figure(figsize=(15, 10))
plt.subplot(3, 1, 1)
librosa.display.waveshow(y, sr=sr)
plt.title(f'Waveform - {speaker_name}')
# Plot the spectrogram
plt.subplot(3, 1, 2)
D = librosa.amplitude_to_db(librosa.stft(y), ref=np.max)
librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title(f'Spectrogram - {speaker_name}')
# Plot the MFCCs
plt.subplot(3, 1, 3)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
librosa.display.specshow(mfccs, x_axis='time')
plt.colorbar()
plt.title(f'MFCCs - {speaker_name}')
plt.tight_layout()
plt.show()
# Paths to the combined audio files
audio_paths = [
'/kaggle/working/combined_files/Benjamin_Netanyau_combined.wav',
'/kaggle/working/combined_files/Nelson_Mandela_combined.wav'
]
# Plot features for each audio file
for audio_path in audio_paths:
plot_audio_features(audio_path)
/tmp/ipykernel_47/3553734747.py:19: UserWarning: amplitude_to_db was called on complex input so phase information will be discarded. To suppress this warning, call amplitude_to_db(np.abs(S)) instead. D = librosa.amplitude_to_db(librosa.stft(y), ref=np.max)
/tmp/ipykernel_47/3553734747.py:19: UserWarning: amplitude_to_db was called on complex input so phase information will be discarded. To suppress this warning, call amplitude_to_db(np.abs(S)) instead. D = librosa.amplitude_to_db(librosa.stft(y), ref=np.max)
# Install speechbrain for i-vector extraction and tqdm for a progress bar
!pip install speechbrain tqdm
Collecting speechbrain Obtaining dependency information for speechbrain from https://files.pythonhosted.org/packages/58/13/e61f1085aebee17d5fc2df19fcc5177c10379be52578afbecdd615a831c9/speechbrain-1.0.3-py3-none-any.whl.metadata Downloading speechbrain-1.0.3-py3-none-any.whl.metadata (24 kB) Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (4.66.1) Collecting hyperpyyaml (from speechbrain) Obtaining dependency information for hyperpyyaml from https://files.pythonhosted.org/packages/33/c9/751b6401887f4b50f9307cc1e53d287b3dc77c375c126aeb6335aff73ccb/HyperPyYAML-1.2.2-py3-none-any.whl.metadata Downloading HyperPyYAML-1.2.2-py3-none-any.whl.metadata (7.6 kB) Requirement already satisfied: joblib in /opt/conda/lib/python3.10/site-packages (from speechbrain) (1.3.2) Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from speechbrain) (1.24.3) Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from speechbrain) (21.3) Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from speechbrain) (1.11.3) Requirement already satisfied: sentencepiece in /opt/conda/lib/python3.10/site-packages (from speechbrain) (0.1.99) Requirement already satisfied: torch>=1.9 in /opt/conda/lib/python3.10/site-packages (from speechbrain) (2.0.0) Requirement already satisfied: torchaudio in /opt/conda/lib/python3.10/site-packages (from speechbrain) (2.0.1) Requirement already satisfied: huggingface_hub in /opt/conda/lib/python3.10/site-packages (from speechbrain) (0.17.3) Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch>=1.9->speechbrain) (3.12.2) Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch>=1.9->speechbrain) (4.5.0) Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch>=1.9->speechbrain) (1.12) Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch>=1.9->speechbrain) (3.1) Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch>=1.9->speechbrain) (3.1.2) Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from huggingface_hub->speechbrain) (2023.10.0) Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from huggingface_hub->speechbrain) (2.31.0) Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from huggingface_hub->speechbrain) (6.0.1) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging->speechbrain) (3.0.9) Requirement already satisfied: ruamel.yaml>=0.17.28 in /opt/conda/lib/python3.10/site-packages (from hyperpyyaml->speechbrain) (0.17.32) Requirement already satisfied: ruamel.yaml.clib>=0.2.7 in /opt/conda/lib/python3.10/site-packages (from ruamel.yaml>=0.17.28->hyperpyyaml->speechbrain) (0.2.7) Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch>=1.9->speechbrain) (2.1.3) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface_hub->speechbrain) (3.2.0) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface_hub->speechbrain) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface_hub->speechbrain) (1.26.15) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface_hub->speechbrain) (2023.7.22) Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->torch>=1.9->speechbrain) (1.3.0) Downloading speechbrain-1.0.3-py3-none-any.whl (864 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 864.1/864.1 kB 10.6 MB/s eta 0:00:0000:010:01 Downloading HyperPyYAML-1.2.2-py3-none-any.whl (16 kB) Installing collected packages: hyperpyyaml, speechbrain Successfully installed hyperpyyaml-1.2.2 speechbrain-1.0.3
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.callbacks import EarlyStopping
# Set the parent directory for speaker folders
parent_dir = "/kaggle/input/speaker-recognition-dataset/16000_pcm_speeches"
# List of speaker folders
speaker_folders = [
"Benjamin_Netanyau",
"Jens_Stoltenberg",
"Julia_Gillard",
"Magaret_Tarcher",
"Nelson_Mandela"
]
def extract_mfcc_features(parent_dir, speaker_folders):
features = []
labels = []
for i, speaker_folder in enumerate(speaker_folders):
speaker_folder_path = os.path.join(parent_dir, speaker_folder)
for filename in os.listdir(speaker_folder_path):
if filename.endswith(".wav"):
file_path = os.path.join(speaker_folder_path, filename)
# Load audio file, ensure 1 second duration
audio, sr = librosa.load(file_path, sr=None, duration=1)
# Extract 13 MFCCs
mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
# Standardize features for each file
mfccs = StandardScaler().fit_transform(mfccs)
# Append the transposed MFCCs (shape: [timesteps, features])
features.append(mfccs.T)
labels.append(i) # Append the numeric label
return np.array(features), np.array(labels)
# Extract features and labels
X, y = extract_mfcc_features(parent_dir, speaker_folders)
print(f"Features shape (Samples, Timesteps, MFCCs): {X.shape}")
print(f"Labels shape (Samples,): {y.shape}")
Features shape (Samples, Timesteps, MFCCs): (7501, 32, 13) Labels shape (Samples,): (7501,)
# Encode string labels to integers (0, 1, 2, 3, 4)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
# We must explicitly set the classes to ensure the correct mapping
label_encoder.classes_ = np.array(speaker_folders)
# Split the data into training (70%) and temporary (30%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y_encoded, test_size=0.3, random_state=42, stratify=y_encoded)
# Split the temporary data into validation (50% of 30% = 15%) and test (50% of 30% = 15%)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
print(f"Training data shape: {X_train.shape}")
print(f"Validation data shape: {X_val.shape}")
print(f"Test data shape: {X_test.shape}")
Training data shape: (5250, 32, 13) Validation data shape: (1125, 32, 13) Test data shape: (1126, 32, 13)
# Define the RNN model
model_rnn = tf.keras.Sequential([
# LSTM layer to process sequences
tf.keras.layers.LSTM(128, input_shape=(X_train.shape[1], X_train.shape[2])),
# Dense hidden layer
tf.keras.layers.Dense(64, activation='relu'),
# Output layer with softmax for multi-class classification
tf.keras.layers.Dense(len(speaker_folders), activation='softmax')
])
# Compile the model
model_rnn.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model_rnn.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 128) 72704
dense (Dense) (None, 64) 8256
dense_1 (Dense) (None, 5) 325
=================================================================
Total params: 81285 (317.52 KB)
Trainable params: 81285 (317.52 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
# Define the EarlyStopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
# Train the model
history = model_rnn.fit(X_train, y_train,
validation_data=(X_val, y_val),
epochs=20,
batch_size=32,
callbacks=[early_stopping])
Epoch 1/20 165/165 [==============================] - 4s 8ms/step - loss: 0.5592 - accuracy: 0.7796 - val_loss: 0.3392 - val_accuracy: 0.8640 Epoch 2/20 165/165 [==============================] - 1s 5ms/step - loss: 0.2275 - accuracy: 0.9208 - val_loss: 0.2067 - val_accuracy: 0.9271 Epoch 3/20 165/165 [==============================] - 1s 5ms/step - loss: 0.1270 - accuracy: 0.9558 - val_loss: 0.1577 - val_accuracy: 0.9431 Epoch 4/20 165/165 [==============================] - 1s 5ms/step - loss: 0.1393 - accuracy: 0.9541 - val_loss: 0.1316 - val_accuracy: 0.9529 Epoch 5/20 165/165 [==============================] - 1s 5ms/step - loss: 0.1004 - accuracy: 0.9661 - val_loss: 0.1090 - val_accuracy: 0.9582 Epoch 6/20 165/165 [==============================] - 1s 5ms/step - loss: 0.0843 - accuracy: 0.9733 - val_loss: 0.1816 - val_accuracy: 0.9369 Epoch 7/20 165/165 [==============================] - 1s 5ms/step - loss: 0.0601 - accuracy: 0.9770 - val_loss: 0.1598 - val_accuracy: 0.9493 Epoch 8/20 165/165 [==============================] - 1s 5ms/step - loss: 0.0556 - accuracy: 0.9813 - val_loss: 0.1753 - val_accuracy: 0.9467
# Plot training vs validation loss
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('RNN Model Training and Validation Loss')
plt.legend()
plt.show()
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
import seaborn as sns
# Predict probabilities on the test set
y_pred_probabilities_rnn = model_rnn.predict(X_test)
# Get the class with the highest probability
y_pred_rnn = np.argmax(y_pred_probabilities_rnn, axis=1)
# Decode labels back to original speaker names
y_test_decoded = label_encoder.inverse_transform(y_test)
y_pred_decoded_rnn = label_encoder.inverse_transform(y_pred_rnn)
# Calculate and print metrics
accuracy_rnn = accuracy_score(y_test_decoded, y_pred_decoded_rnn)
f1_rnn = f1_score(y_test_decoded, y_pred_decoded_rnn, average='weighted')
print(f"MFCC + RNN Model Test Accuracy: {accuracy_rnn * 100:.2f}%")
print(f"MFCC + RNN Model Weighted F1 Score: {f1_rnn:.4f}")
# Plot the confusion matrix
conf_matrix_rnn = confusion_matrix(y_test_decoded, y_pred_decoded_rnn, labels=speaker_folders)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_rnn, annot=True, fmt="d", cmap="Blues",
xticklabels=speaker_folders, yticklabels=speaker_folders)
plt.xticks(rotation=45, ha="right")
plt.title("Confusion Matrix: MFCC + RNN Model")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
36/36 [==============================] - 0s 2ms/step MFCC + RNN Model Test Accuracy: 96.45% MFCC + RNN Model Weighted F1 Score: 0.9642
# Install speechbrain for i-vector extraction and tqdm for a progress bar
!pip install speechbrain tqdm -q
import torch
from tqdm import tqdm
from speechbrain.pretrained import EncoderClassifier
from sklearn.linear_model import LogisticRegression
/tmp/ipykernel_47/697404311.py:6: UserWarning: Module 'speechbrain.pretrained' was deprecated, redirecting to 'speechbrain.inference'. Please update your script. This is a change from SpeechBrain 1.0. See: https://github.com/speechbrain/speechbrain/releases/tag/v1.0.0 from speechbrain.pretrained import EncoderClassifier
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login
# Get the token from Kaggle secrets
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_TOKEN")
# Log in to Hugging Face
login(token=hf_token)
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well. Token is valid (permission: fineGrained). Your token has been saved to /root/.cache/huggingface/token Login successful
# We use tqdm.notebook.tqdm for a stable progress bar in Jupyter
from tqdm.notebook import tqdm
# We need to re-scan the file paths and labels
parent_dir = "/kaggle/input/speaker-recognition-dataset/16000_pcm_speeches"
speaker_folders = [
"Benjamin_Netanyau",
"Jens_Stoltenberg",
"Julia_Gillard",
"Magaret_Tarcher",
"Nelson_Mandela"
]
file_paths = []
labels_xvec = []
for i, speaker_folder in enumerate(speaker_folders):
speaker_folder_path = os.path.join(parent_dir, speaker_folder)
for filename in os.listdir(speaker_folder_path):
if filename.endswith(".wav"):
file_paths.append(os.path.join(speaker_folder_path, filename))
labels_xvec.append(i) # Use integer labels
print(f"Total files found: {len(file_paths)}")
# --- Extract x-vectors ---
all_xvectors = []
# This loop will now show a clean progress bar
for f_path in tqdm(file_paths, desc="Extracting x-vectors"):
# Load audio file using speechbrain's loader
signal = xvector_extractor.load_audio(f_path)
# Extract x-vector (embedding)
# The model returns a 3D tensor [batch, 1, embedding_dim], so we squeeze it
xvec = xvector_extractor.encode_batch(signal)
xvec = xvec.squeeze().cpu().numpy()
all_xvectors.append(xvec)
# Convert list of arrays to a 2D NumPy array (Samples, Embedding_Dim)
X_xvec = np.array(all_xvectors)
y_xvec = np.array(labels_xvec)
print(f"\nx-vector features shape: {X_xvec.shape}")
print(f"x-vector labels shape: {y_xvec.shape}")
Total files found: 7501
Extracting x-vectors: 0%| | 0/7501 [00:00<?, ?it/s]
x-vector features shape: (7501, 512) x-vector labels shape: (7501,)
# We use tqdm.notebook.tqdm for a stable progress bar in Jupyter
from tqdm.notebook import tqdm
# We need to re-scan the file paths and labels
parent_dir = "/kaggle/input/speaker-recognition-dataset/16000_pcm_speeches"
speaker_folders = [
"Benjamin_Netanyau",
"Jens_Stoltenberg",
"Julia_Gillard",
"Magaret_Tarcher",
"Nelson_Mandela"
]
file_paths = []
labels_xvec = []
for i, speaker_folder in enumerate(speaker_folders):
speaker_folder_path = os.path.join(parent_dir, speaker_folder)
for filename in os.listdir(speaker_folder_path):
if filename.endswith(".wav"):
file_paths.append(os.path.join(speaker_folder_path, filename))
labels_xvec.append(i) # Use integer labels
print(f"Total files found: {len(file_paths)}")
# --- Extract x-vectors ---
all_xvectors = []
# This loop will now show a clean progress bar
for f_path in tqdm(file_paths, desc="Extracting x-vectors"):
# --- THIS IS THE CORRECTED LINE ---
# The load_audio() wrapper resamples to 16kHz and returns only the signal
signal = xvector_extractor.load_audio(f_path)
# Extract x-vector (embedding)
# The model returns a 3D tensor [batch, 1, embedding_dim], so we squeeze it
xvec = xvector_extractor.encode_batch(signal)
xvec = xvec.squeeze().cpu().numpy()
all_xvectors.append(xvec)
# Convert list of arrays to a 2D NumPy array (Samples, Embedding_Dim)
X_xvec = np.array(all_xvectors)
y_xvec = np.array(labels_xvec)
print(f"\nx-vector features shape: {X_xvec.shape}")
print(f"x-vector labels shape: {y_xvec.shape}")
Total files found: 7501
Extracting x-vectors: 0%| | 0/7501 [00:00<?, ?it/s]
x-vector features shape: (7501, 512) x-vector labels shape: (7501,)
from sklearn.model_selection import train_test_split
# Split x-vector data
X_train_xvec, X_temp_xvec, y_train_xvec, y_temp_xvec = train_test_split(
X_xvec, y_xvec, test_size=0.3, random_state=42, stratify=y_xvec
)
X_val_xvec, X_test_xvec, y_val_xvec, y_test_xvec = train_test_split(
X_temp_xvec, y_temp_xvec, test_size=0.5, random_state=42, stratify=y_temp_xvec
)
print(f"Training x-vectors shape: {X_train_xvec.shape}")
print(f"Validation x-vectors shape: {X_val_xvec.shape}")
print(f"Test x-vectors shape: {X_test_xvec.shape}")
Training x-vectors shape: (5250, 512) Validation x-vectors shape: (1125, 512) Test x-vectors shape: (1126, 512)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# 1. Scale the x-vector features
scaler_xvec = StandardScaler()
X_train_scaled_xvec = scaler_xvec.fit_transform(X_train_xvec)
X_test_scaled_xvec = scaler_xvec.transform(X_test_xvec)
# 2. Train a simple backend classifier
print("Training backend classifier (Logistic Regression)...")
backend_model = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
backend_model.fit(X_train_scaled_xvec, y_train_xvec)
print("Backend classifier trained.")
Training backend classifier (Logistic Regression)... Backend classifier trained.
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Predict on the test set
y_pred_xvec = backend_model.predict(X_test_scaled_xvec)
# We can re-use the label_encoder from Part 1 to get speaker names
y_test_decoded_xvec = label_encoder.inverse_transform(y_test_xvec)
y_pred_decoded_xvec = label_encoder.inverse_transform(y_pred_xvec)
# Calculate and print metrics
accuracy_xvec = accuracy_score(y_test_decoded_xvec, y_pred_decoded_xvec)
f1_xvec = f1_score(y_test_decoded_xvec, y_pred_decoded_xvec, average='weighted')
print(f"x-Vector Model Test Accuracy: {accuracy_xvec * 100:.2f}%")
print(f"x-Vector Model Weighted F1 Score: {f1_xvec:.4f}")
# Plot the confusion matrix
conf_matrix_xvec = confusion_matrix(y_test_decoded_xvec, y_pred_decoded_xvec, labels=speaker_folders)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_xvec, annot=True, fmt="d", cmap="Blues",
xticklabels=speaker_folders, yticklabels=speaker_folders)
plt.xticks(rotation=45, ha="right")
plt.title("Confusion Matrix: x-Vector Model")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
x-Vector Model Test Accuracy: 92.10% x-Vector Model Weighted F1 Score: 0.9212
import matplotlib.pyplot as plt
# These variables should be in memory from Cell 7 and Cell 13
# accuracy_rnn = ...
# accuracy_xvec = ...
model_names = ['MFCC + RNN', 'x-Vector + Classifier']
accuracies = [accuracy_rnn, accuracy_xvec]
# Create the bar plot
plt.figure(figsize=(8, 6))
bars = plt.bar(model_names, accuracies, color=['#007bff', '#28a745']) # Blue and Green
# Add labels and title
plt.ylabel('Test Accuracy Score')
plt.title('Speaker Recognition Model Comparison')
plt.ylim(0.0, 1.05) # Set y-axis from 0% to 105%
# Add the accuracy values on top of the bars for clarity
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2.0, yval + 0.01, f'{yval*100:.2f}%', ha='center', va='bottom')
plt.show()
Final Observation¶
The MFCC+RNN model (96.45% accuracy) outperformed the x-vector model (92.10%). This suggests the temporal sequence learned by the RNN was a stronger feature for this dataset than the pre-trained x-vector embedding.